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GCD computations and variants of the Euclidean algorithm enjoy broad uses in both classical 
and quantum algorithms. In this paper, we propose quantum circuits for GCD computation with 
0(n log n) depth with 0(n) ancillae. Prior circuit construction needs 0(n 2 ) running time with 0(n) 
ancillae. The proposed construction is based on the binary GCD algorithm and it benefits from 
log-depth circuits for 1-bit shift, comparison/subtraction, and managing ancillae. The worst-case 
gate count remains 0(n 2 ), as in traditional circuits. 
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I. INTRODUCTION 

The development and optimization of specific quan- 
tum circuits is primarily viewed from the perspective of 
quantum algorithms in the sense that many quantum 
models of computation are defined in terms of quantum 
circuits. In this context, circuit blocks arising in specific 
quantum algorithms deserve particular attention. Such 
blocks sometimes implement well-known classical algo- 
rithms, but must ensure reversibility, judicious use of 
ancillae, the restoration of pre-initialized values, and 
reasonable resource optimization. 

Circuit blocks studied in this work encompass GCD 
computations and variants of the Euclidean Algorithm, 
which enjoy broad uses in both classical and quan- 
tum algorithms. Classical modular- inverse computations 
and continued-fraction expansions use similar algorithms. 
Reversible GCD circuits have been successfully used in 
quantum algorithms for extracting square-free factors of 
large integers using Gauss sums |l| and solving Pell's 
equation [2]. These algorithms offer significant quantum 
speed-up. GCD circuits also form the core of algorithms 
for number- factoring based on Gauss sums [3j , but these 
algorithms have been less competitive than other tech- 
niques so far [4, 5]. Other applications include elliptic- 
curve arithmetic and solutions of the discrete-logarithm 
problem [6, 7]. GCD circuits are also attractive as bench- 
marks for quantum arithmetics, as they are smaller than 
modular exponentiation circuits [8j. 

In this paper, we propose 0(nlogn)-depth, 0(n 2 )-size 
quantum circuits for GCD computation with 0(n) an- 
cillae. Prior constructions result in 0{n 2 ) running time 
with 0(n) ancillae [l|, @|. The remaining part of this pa- 
per is organized as follows. We introduce background 
concepts on quantum circuits in Section [TTJ In Section 
IIIIl theoretical background for GCD computation is dis- 
cussed. This section includes an introduction of the sim- 
ple Euclidean algorithm and its extended version as well 
as the binary GCD algorithm which is particularly used 
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in this paper. Prior circuit structures are reviewed in Sec- 
tion [TV] The 0(nlogn)-depth circuit structure for GCD 
computation is proposed in Section [Vj and Section IVII 
concludes the paper. 



II. BACKGROUND ON QUANTUM CIRCUITS 

A quantum bit (qubit) can be treated as a mathemat- 
ical object that represents a quantum state with two ba- 
sic states |0) and |1). It can carry a linear combination 
■(/>) = er|0)+/3|l) of its basic states, called a superposition, 
where a and j3 are complex numbers and |a| 2 +|/?| 2 = l. 

A matrix U is unitary if UU> — I where W is the 
conjugate transpose of U and / is the identity matrix. 
An ri-qubit quantum gate performs a 2™ x 2" unitary 
operation U on n qubits in a specific period of time. For 
a gate g with a unitary matrix U g , its inverse gate g~ x 
implements the unitary matrix U~ . Two gates can be 
executed in parallel if they share neither control(s) nor 
target (s). Given any unitary gate U over m qubits, a 
controlled-C/ gate with k control qubits can be defined as 
an (m + fc)-qubit gate that applies U on the m qubits if 
and only if all k control qubits are |1). Additionally 

• A multiple- control Toffoli gate C'"NOT 
(xi,--- ,x m +i) passes the first m qubits un- 
changed. These qubits are referred to as controls. 
This gate flips the value of (m + l) th qubit if and 
only if each positive (negative) control line carries 
the 1 (0) value. For m — 0, 1, 2 the gates are called 
NOT, CNOT, and Toffoli, respectively. 

• A multiple- control Fredkin gate 
Fred(xi, X2, • • • ,x m +2) has two target lines 
x m+ i,x m+ 2 and m control lines x\,Xi,--- ,x m - 
The gate interchanges the values of the targets 
if the conjunction of all m positive (negative) 
controls evaluates to 1 (0). For m = 0, 1 the gates 
are called SWAP and Fredkin, respectively. 

In all circuit diagrams, horizontal lines are variables, 
vertical lines are gates, and time flows left to right. Ad- 
ditionally, • (or o) is used for conditioning on the qubit 
being set to value '1' (or '0'), © is used to denote target 
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Figure 1: SWAP gate (a) can be constructed by 3 CNOTs 
(b). A conditional SWAP (Fredkin) (c) can be implemented 
by three Toffoli gates (d) or one Toffoli and two CNOTs (e). 



line of a multiple-control Toffoli gate, and x is used on 
qubits of a SWAP (or a controlled SWAP) gate. Fig. Q}a 
shows a SWAP gate which can be implemented by three 
CNOT gates as shown in Fig. QJb. Adding one control 
to SWAP gate (Fig. Q~|-c) results in a Fredkin gate which 
can be implemented with three Toffoli gates (Fig. 03-d) 
or one Toffoli gate and two CNOTs (Fig. [T}e). 

The lines which are added to a quantum circuit are 
named ancillae.[14] We use zero-initialized ancillae in this 
work. The zero-initialized ancillae may be modified in- 
side a given circuit, but should be returned to zero at the 
end of computation to be reused. The number of qubits, 
which include both main qubits and ancillae registers, 
are very limited in current quantum technologies. 



III. GREATEST COMMON DIVISOR 

Algorithms discussed in this paper perform integer 
arithmetic which can be described with C/C++ oper- 
ators. 

• / for integer division, e.g., fO / 6 = 1 

• % for the remainder operation, e.g., 10 % 6 = 4 

In particular, n/2 shifts the bits of n to the right by one 
position, and n%2 = checks if n is even. As illustrated 
in Fig. [2l-a, the n/2 operator (1-bit shift) can be imple- 
mented by a cascade of SWAP gates. This can be verified 
by exchanging the lines involved in each SWAP gate. 

The greatest common divisor (GCD) of two integers A 
and B can be found by the Euclidean algorithm which 
performs successive division with remainder, given that 
for A = Bq + r with all positive numbers, gcd(A, B) — 
gcd(B, r — A%B). The extended Euclidean algorithm 
additionally finds integers x and y that satisfy Bezout's 
identity Ax + By = gcd(A, B). For coprime A and B, 
x is the multiplicative inverse of A modulo B, and y is 
the multiplicative inverse of B modulo A. This modular 
inverse A^ 1 = x (mod B) enjoys applications in various 
fields including cryptography. 

The Binary GCD Algorithm [9], also called Stein's 
algorithm, computes the GCD of two nonnegative inte- 
gers a and b using subtractions and divisions by two, 
which are easy to implement in hardware. The algo- 
rithm maintains two numbers, starting with a and b, 
but replaces them at every step with a pair that has the 
same GCD. The following steps are repeated until either 
A = B or A = 0. 



Xq 
■l'l 
X2 
X3 
Xa 
Xs 

x e 

X7 



r 



tz 



tt 



x 7 


x 


Xo 


Xl 


XI 


X2 


X2 


X3 


X3 = 


= Xi 


Xi 


x 5 


x 5 


x 6 


xe 


X-T 



XT 
Xo 
XI 
X2 
X3 
Xi 



(a) 



(b) 



Figure 2: Circular 1-bit shift on 8 qubits. (a) Straightfor- 
ward implementation, (b) Constant-depth implementation. 
All SWAP gates in dashed boxes can be executed in parallel. 
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Figure 3: An outline of the binary GCD algorithm (one step). 
A backslash (\) on a horizontal line represents a multiqubit 
bus. The added ancillae are used to evaluate A < B, B%2 — 
0, A%2 = 0, and A 7^ 0, respectively. The last n-qubit register 
is used to hold the value of 7? = gcd(j4, B) at each step. A • 
(or o) on an n-qubit register denotes a conditional operation. 
The input, output, and ancilla registers for each block are 
specified by V, 'o', and 'a' on the related lines, respectively. 



• If A%2 = B%2 = 0, gcd(A,B) = 2gcd{A/2,B/2) 

• If A%2 = = 1- B%2, gcd(A, B) = gcA{A/2, B) 
If A%2 = 1 = 1- B%2, gcd(A, B) = gcd(A, B/2) 

• If A%2 = B%2 = 1, then we ensure that A > B, 
and use gcd(^4, B) = gcd (±^-,B) 

The last branch is performed with a single test (A < B) 
that controls Fred(A,B), followed by A = £=5. The 
binary GCD algorithm is outlined in Fig. [3] In this 
figure, the register R is used to save the intermediate 
GCD value at each step. Initially R — 1 and at each 
step R = 2 • R if A%2 = B%2 = 0. After the last 
GCD iteration, R ■ B computes the result. Note that the 
comparison blocks may need n zero-initialized ancillae to 
compute the result, but the resulting value is a single bit. 
For rt-bit numbers, each step takes linear time, given 
that comparison, subtraction, and circular shift have 
linear-size circuits. 0{n) steps are followed by an 0(n 2 )- 
gate multiplication B ■ R. Thus, the binary GCD algo- 
rithm needs 0(n 2 ) time. On average, it uses 60% fewer 
bit operations than the Euclidean algorithm [1C|, but 
does not improve asymptotic performance. Similar to 
the extended Euclidean algorithm, an extended binary 
GCD algorithm is suggested in [3, p. 338 & p. 646] 
which performs subtractions rather than more general 
divisions with remainder. The removal of factors of two 
is irreversible, but can be implemented by circular shifts 



that move trailing zeros to the most significant bits. Such 
an arrangement still requires clearing control values. The 
construction proposed in this work is based on the binary 
GCD algorithm. 



IV. PRIOR WORK 

Our work focuses on GCD and related computations 
for integers, rather than for polynomials over binary fields 
[6]. To implement binary GCD by a quantum circuit, |l| 
used three extra n-qubit ancilla registers, see Fig. [3] for 
an outline, to (1) check the termination condition (A = B 
or A = 0) after each step, (2) verify whether A and B are 
even or not, and (3) check A < B. Each step of the algo- 
rithm performs even/odd and greater/less comparisons. 
The maximum possible number of steps should be imple- 
mented by explicit circuit blocks, as the actual number 
of steps depends on A and B. This path was pursued 
in [l| which leads to 0(n 2 ) runtime. In pfl, the authors 
proposed a quantum circuit for the extended Euclidean 
algorithm with 0(n 2 ) time complexity and O(n) space. 
Applying the method for the binary extended Euclidean 
algorithm leads to In + e qubits and a running time of 
0(n 2 ) [7j. The authors did not clear all zero- initialized 
ancillae which limits the applicability of their techniques. 



V. GCD CIRCUITS WITH O(nlogn) DEPTH 

Each step of the binary GCD algorithm includes sev- 
eral data-dependent branches. Given that quantum cir- 
cuits must work correctly with superposition states, all 
branches must be implemented explicitly and the longest 
possible execution trace must be supported. Such a trace 
includes n steps, each one performs either a single sub- 
traction or divisions by two. In GCD computation, a 
1-bit circular right shift can implement the division-by- 
two operator as the least significant line holds whenever 
a division-by-two is called. Otherwise, one needs to ex- 
clude one line from the rest of computation each time. 
When circuit depth is considered, one can use log-depth 
adder/subtractor circuits with O(n) ancillae [ll|, and the 
conventional implementation of a shift as a sequence of 
swaps becomes a bottleneck. 

To implement a logarithmic-depth circuit for GCD 
computations, we use ideas from |l2j, which has not con- 
sidered GCD computations, but studied parallel quan- 
tum circuits. The authors point out that any fixed bit- 
permutation can be implemented with O(l) depth using 
n zero-initialized ancilla in four layers — by copying the 
bits to ancillae in parallel, canceling the originals, copy- 
ing the ancillae, and then canceling the ancillae. Con- 
sider Fig. |3]-a which illustrates the permutation cycle 
(0,3,1,2) with 4 ancillae. This circuit transforms xq to 
X3, X3 to xi, x\ to X2, and finally X2 to x$. On the other 
hand, the depth of six layers can be achieved with no 
ancillae by dealing with each cycle individually and de- 
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Figure 4: Implementation of any fixed bit-permutation by a 
constant-depth quantum circuit [13 |. Constructing the per- 
mutation cycle (0,3,1,2) by depth 4 with ancillae (a), and 
with depth 6 without ancillae (b). In (b), each SWAP gate 
needs 3 CNOT gates for implementation. Gates in dashed 
boxes can be executed in parallel. 



composing it into a product of two sets of disjoint swaps. 
Consider a fc-cycle[15] Cfe = (1, 2, 3,4, 5, ..., k). Then for 
7T fe = (l,2)(fc,3)(fc - 1,4)... note that 7r fc = tt^ 1 . If k 
is odd, 7Tj will have one fixed point, but it can any- 
way be implemented by k/2 parallel swaps. Further- 
more, note that pk — (Jk^k has a similar cycle structure 
and can also be implemented by parallel swaps. There- 
fore, by implementing pk and iTk with disjoint swaps, we 
implement Ok in constant depth. Fig. HJ-b illustrates 
the permutation cycle in Fig. HJ-a without ancillae. In 
(b), the first two SWAP gates construct the permuta- 
tion (0,1)(1,2) and the third SWAP constructs (0,3). 
Following this path, a depth 2 single-bit circular shift 
is shown in Fig. [2}b which includes the transpositions 
(0,1)(2,7)(3,6)(4,5)(0,2)(3,7)(4,6). 

The work in [12] points out that ~ n gates controlled 
by a shared bit (fanout) cannot be applied in parallel 
directly, but illustrates a straightforward technique that 
copies the control value to n ancillae with depth logn 
and clears the ancillae after their use. This approach is 
illustrated in Fig. \5\ where the initial and final CNOT 
gates are used to prepare and clear the added ancillae, 
respectively. This adds 2 log n + 1 latency. However, the 
main circuit block which includes applying conditional 
unitaries is parallelized to depth 1. 

Following Fig. [31 each step of the binary GCD al- 
gorithm may include a single conditional subtraction, 
and/or a single-bit conditional shift. The A%2 = and 
B%2 = blocks can be implemented unconditionally 
since they either check whether A and/or B are even 
or not without modifying the values of A and B regis- 
ters. Similarly, A < B can be computed unconditionally. 
The conditional 1-bit shift on A when A%2 — can also 
be applied even HA = 0. This simplifies the second 
circular 1-bit shift operation in Fig. O To implement 
conditional ~ B , note that one of the conditionals is on 
A = 0. IfA: 
Accordingly, 



0, then A%2 = 0, and 
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is not applied, 
can be computed with a single condi- 
tional. The result of these optimizations is shown in Fig. 
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Figure 5: logn-depth implementation of shared control 
(fanout) 13. 



O Additionally, 

• The unconditional comparison A < B and A ^ 
can be implemented by circuits with logarithmic 
depth with 0(n) cleared ancillae [11] . 

• The conditional subtraction A — B can be imple- 
mented by a circuit with logarithmic depth with 
0(n) cleared ancillae. This can be done by fol- 
lowing the circuit structure in [11], and replacing 
CNOT and NOT gates on output lines by Toffoli 
and CNOT gates, respectively. 

• Circuit for A%2 = (and B%2 = 0) includes a 
single CNOT conditioned on the last bit of A (and 
B). 

• Swapping two n-qubit registers A and B can be 
done in one step by applying n SWAP gates on 
disjoint qubits in parallel. Conditional Fred(A, B) 
can be implemented by log n depth with n ancillae 

— a log-depth circuit to replicate the conditional on 
n ancillae and a circuit with depth 1 for Fred(A, B). 
All ancillae can be cleared. 

• Unconditional bit shift can be implemented with 
a constant-depth circuit. For conditional shift, 
one can use n ancillae to replicate the control in 
O(logn) time. Accordingly conditional shift can 
be parallelized to O(logn) depth. All ancillae can 
be cleared since the conditional remains unchanged. 

Table U reports the values of gate count and circuit 
depth for different circuit blocks. In this table, the num- 
bers of CNOT and Toffoli gates are reported indepen- 
dently as a [#CNOT; #Toffoli] pair. Values for com- 
parison and conditional subtraction can be obtained by 
following the circuit structures, depths, and sizes given 
in [11| and the notes above. For conditional SWAP, note 
that 2n CNOT gates (with depth 2 [logn]) are used to 
prepare and clear the ancilla register. Each Fredkin gate 
can be implemented by two CNOT and one Toffoli gates 
as illustrated in Fig[T]-e, and there are n parallel Fredkin 
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Figure 6: Restructuring the binary GCD algorithm of Fig. [3] 
(one step). The input, ancilla, and output registers for each 
block are specified by V, 'a', and 'o' on the related lines, 
respectively. 



gates in total. Therefore, circuit depth can be computed 
as 2 [logn] +2 CNOTs, and one Toffoli. Similarly, circuit 
size is An CNOTs, and n Toffoli gates. To count the num- 
ber of gates for conditional 1-bit circular shift, note that 
2n CNOTs (with depth 2 [log n)] ) are used to prepare 
and clear ancillae and the remaining n — 1 Fredkin gates 
can be implemented with constant depth (i.e., 4 CNOT 
and 2 Toffoli gates) and linear size (i.e. 2n — 2 CNOT 
and n— 1 Toffoli gates). Altogether, the conditional 1-bit 
circular shift circuit needs An — 2 CNOT and n — 1 Tof- 
foli gates with depth 2 [log n)~\ + A CNOT and 2 Toffoli 
Gates. Considering the values given in Table U and the 
circuit structure in Fig. [5] reveals that each step of the 
GCD computation can be implemented by a log-depth 
and linear-size circuit. 

To compute the final GCD, a multiplication R ■ B is 
applied after all steps where R is a power of two. Multi- 
plication by R can be done by a circular shift. [16] Since 
R value is computed during GCD iterations, we use n 
controlled-shifts by 2 l (i < n).[l7] These power-of-two 
shifts can be performed in any order, but the conven- 
tional quantum-circuit model does not allow parallel ex- 
ecution of gates operating on the same qubits. Since R 
is a power of two in the GCD computation, only one 
of the controlled shifts will be applied. Hence, all con- 
trolled power-of-two shifts may be applied simultaneously 
on the same targets. A controlled shift operation can be 
implemented in O(logn) depth with 0(n) ancillae. Ac- 
cordingly, the last multiplication of B by R can be im- 
plemented with a log-depth, quadratic-size circuit. |18j 

To count ancillae, note that all computational ancillae 
are cleared inside each block. After the final multiplica- 
tion block for R ■ B, one can copy (in log depth) the final 
GCD result to another n-qubit zero-initialized register 
and apply the whole circuit (except for copying the re- 
sult) in reverse order to recover A, B. and zero-initialized 
ancillae. Given that all components use 0(n) ancillae 
(see Table |T|, the total number of ancillae remains linear. 

Considering the worst-case number of iterations n to 
find GCD of two n-bit numbers A and B, binary GCD 
computation can be implemented with a O(nlogn)- 
depth, 0(n 2 )-size quantum circuit and 0(n) ancillae. 



Table I: Size, depth, and ancillae in different circuit blocks. The number of CNOT and Toffoli gates are reported separably 
as a [#CNOT;#Toffoli] pair. For comparison and subtraction blocks we used the method in [ll|. In those cases, the number 
of ones in the binary expansion of n is represented by w(n). Prior constructions []], 01 use linear-size and linear-depth circuits 
with 0(n) ancillae for each step of the GCD computation where we use linear-size and log-depth with 0(n) ancillae. 



Block 


Characteristics 


Reference 


Comparison 


Size: [2n - 2; 6ra - w(n - 1) - 2 Llog(n - 1)J - 7] 
Depth: [2; 2 [log nj +5] 
Ancillae: 2n — [log(n — 1)J — 3 


[Ul 


Conditional subtraction 


Size: [2n;14n - 11] 

Depth: [2; 3 [log(n - 1)J + ^log 2^1 J + 16] 

Ancillae: 2n - 2 


LI 


Conditional 1-bit circular shift 


Size: [4n - 2; n - 1] 
Depth: [2 [log n~] + 4; 2] 
Ancillae: n 


This work 


Conditional SWAP 


Size: [An - 2; n - 1] 
Depth: [2 [log n\ +4:2] 
Ancillae: n 


This work 



VI. CONCLUSION 

We demonstrated reversible controlled circular-shift 
circuits with logn depth and 0(n) ancillae. Using these 
circuits, we proposed 0(nlogn)-depth quantum circuits 
for GCD computation. 

The Euclidean algorithm finds the greatest common 
divisor in 0{n 2 ) time. However, it is unknown whether 
this can be accomplished in 0(log Cl n) time using 0(n C2 ) 
parallel processors (for constants C\,C2). Notably, paral- 
lel algorithms faster than the Euclidean algorithm have 



been proposed. The fastest known deterministic classi- 
cal algorithm solves the problem in 0(n/ logn) time with 
n 1+e processors [13]. We do not try to make such parallel 
GCD constructs reversible, and these techniques require 
significant overhead, including many ancillae and large 
circuits. Finding a sharper bound on quantum-circuit 
depth for GCD computation using a reasonable number 
of gates and ancillae is an interesting open question. 
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